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Emanuel  Parzen 
Department  of  Statistics 
Texas  A&M  University 
College  Station,  Texas  77843-3143  USA 


1.  Introduction 

This  paper  on  statistical  data  modeling  is  written  to  express  my  esteem  for 
Professor  Hirotugu  Akaike  in  celebration  of  his  65th  birthday  by  a  U.S./Japan 
Conference  on  The  Frontiers  of  Statistical  Modeling:  An  Informational  Approach 
(held  May,  1992  at  the  University  of  Tennessee).  I  like  to  play  the  game  “how  long 
have  you  known  Professor  Akaike”  because  I  have  the  good  fortune  of  knowing 
him  since  1965. 

Parzen  (1979)  argues  that  statistical  data  analysis  should  be  defined  as  fit¬ 
ting  probability  models  to  data.  This  paper  presents  typical  concepts  and  recent 
results  of  our  modeling  theory  which  emphasizes  quantile  domain  functions,  in¬ 
formation  measures,  and  comparison  density  estimation.  Ultimate  goals  include: 
unify  parametric  and  nonparametric  inference  for  continuous  and  discrete  data; 
demonstrate  that  mathematical  statistical  and  data  analytic  approaches  are  both 
needed  for  statistical  inference;  stimulate  exoteric  methods  (applicable  by  applied 
researchers)  rather  than  esoteric  methods  (known  only  to  a  small  group  of  math¬ 
ematical  statisticians);  combine  mathematical  statistical  and  data  analytic  views 
to  develop  methods  of  statistical  analysis  which  are  based  on  assumptions  (known 
model)  which  are  tested  in  ways  that  provide  insight  how  to  model  deviations  of  the 
data  from  the  assumed  model  (and  thus  identify  a  “true”  model  as  an  “iterated” 
model). 

Contents  of  the  paper  are:  1.  Introduction;  2.  Quantile  domain  functions; 
F,  Q;  3.  Mid  distribution  and  quantile  functions:  Fmid,  Qmid\  4.  Sample  distribu¬ 
tion  and  quantile:  F~,  Q'\  5.  Comparison  distribution  and  comparison  density  for 
continuous  F,  G\  6.  Comparison  distribution  and  comparison  density  for  discrete 
F,  G\  7.  Information  measures:  Renyi,  Chi-square;  8.  Information  for  comparison 
density  functions;  9.  Information  measures  and  entropy  tests  of  fit;  10.  Continuous 
versions  of  discrete  distribution  functions:  F°,  Qc;  11.  Comparison  distributions  of 
one  sample  (continuous  data);  12.  One  sample  parameter  estimation;  13.  Compar- 
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ison  distributions  of  two  samples  (continuous  data);  14.  Two  sample  comparison 
density  estimation. 

2.  Quantile  domain  functions:  F,  Q 

The  probability  law  of  a  random  variable  Y  is  described  by  its  true  distribution 
function 

F(y)  =  Prob[Y  <  y],  — oo  <  y  <  oo, 

and  true  quantile  function 

Q(u)  =  F~l(u )  =  inf[y  :  F(y)  >  «]. 

The  most  famous  parametric  model  F(y;  0)  is  the  Normal  Distribution: 

F{y\n,o)  =  $((y  - 

$(y)  =  f  <f>(x)dx , 

J— oo 

<f>(x )  =  (27r)--5exp(— .5i2), 

Q(u;  fi,a)  =  n  + 

Every  random  variable  Y  has  a  probability  mass  function,  defined 

p(y)  =  Prob[Y  =  y]. 

One  can  define  p(y)  analytically  in  terms  of  the  distribution  function  F(y)  as  the 
jump  in  F  at  y.  Similarly  the  spacing  function  of  Y ,  denoted  sp(u),  is  defined  as 
the  jump  in  the  quantile  function  Q  at  u;  the  jump  at  u  is  the  difference  between 
the  right  hand  and  left  hand  limits  at  u. 

Continuous  random  variables  obey  p(y)  =  0  for  all  y.  Discrete  random  variables 
obey 

£p(y)  =  !• 

all  y 

We  call  a  random  variable  bi-continuous  if  both  p  and  sp  are  identically  zero.  One 
of  our  goals  is  to  unify  data  analysis  for  continuous  and  discrete  data. 

An  (absolutely)  continuous  random  variable  Y  has  a  distribution  function  which 
is  determined  by  its  probability  density  function  /(y)  =  F,(y).  It  is  bi-continuous 
if  /(y)  >  0  for  all  y  satisfying  0  <  F(y)  <  1. 

For  a  continuous  random  variable  the  density  quantile  function  is  defined  by 

/Q(«)  =  /(<?(«))• 

Then  Q(u)  has  a  quantile  density  q(u)  =  Q'(u)  satisfying 


9(*0  =  V/Q(u),/Q(u)g(u)  =  1. 
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To  prove  this  important  formula  verify  that  the  indefinite  integral  of  l/fQ(u) 
equals  Q(u),  or  differentiate  the  identity  FQ(ti)  —  u  which  holds  for  continuous 
F. 

Quantile  data  analysis  implements  methods  of  Data  Analysis  in  the  distribution 
F  and  the  quantile  Q  domains.  Parallel  functions  in  the  two  domains  are:  F,  Q\ 
p,sp;  /,  q.  Important  interpretations  are  given  by  log  /  and  log  q.  The  inverse  of 
/  is  not  used  but  the  inverse  of  q  is  important  and  is  given  by  }Q. 

Three  general  properties  of  quantile  functions  are: 

F(Q(u))  =  u  if  F  is  continuous  at  y  =  Q(u); 

F(y)  >  u  if  and  only  if  Q(u)  <  y; 

F“^)(u)  =  g(Fyi(u))  if  g  is  a  function  with  the  mathematical  properties  of  a 
quantile  function:  non-decreasing  and  left-continuous. 

Two  important  applications  of  quantile  functions  are  concerned  with  transform¬ 
ing  Y  to  and  from  a  random  variable  U  which  is  Uniform[0,l]: 

Y  and  Q(U )  are  identically  distributed  (since  Qq(U)(u )  =  Q(Qu(u))  =  Q(u))- 
F(Y)  and  U  are  identically  distributed  if  F  is  continuous  (since  Qjr(y)( ti)  = 

F(Q(u))  =  u). 

3.  Mid-distribution  and  quantile  functions:  Fm,d,  Qmid 

We  define  several  versions  of  the  distribution  function  which  we  believe  should 
play  important  roles  in  statistical  data  analysis.  Versions  of  F  and  Q  which  we 
believe  should  be  used  routinely  in  the  theory  and  practice  of  statistics  are  the 
mid-distribution  function  Fmtd(y),  defined  by 

*"""(»)  =  F(y)  -  Mv), 

and  the  mid-quantile  function  defined  by 

Qmtd(u)  =  Q(u)  +  .5  sp(u). 

Note  that  F(y)  is  right  continuous,  and  Q{n)  is  left  continuous.  Note  that  Fmid 
and  Qmid  are  not  inverses  of  each  other. 

We  recommend  as  the  definition  of  the  probability  integral  transformation  (or 
rank  transform)  of  a  continuous  or  discrete  random  variable  Y 

W  =  FmW(r); 

it  has  mean  F[W]  =  .5  and  variance 

VARIK]  =  (1/12)(1  -  Y,  P3M)- 

ally 

It  is  easy  to  verify  this  for  a  continuous  Y  (then  W  is  Uniform[0,l])  and  for  a 
Bernoulli  random  variable  Y  taking  values  0  and  1  with  probabilities  q  and  p; 
FmW(0)  =  .5 q  and  FmW(  1)  =  1  -  .5 p. 
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4.  Sample  distribution  and  quantile:  F,  Q" 

Our  approach  to  data  analysis  of  a  data  set  yi, . . . ,  Yn  involves  defining  sample 
versions  of  F  and  Q.  The  initial  way  to  represent  a  sample  is  to  form  its  sample 
distribution  function 


F~{y )  =  fraction  of  sample  <  y 
and  its  sample  quantile  function 


Q-(«)  =  f--'(u) 

Explicit  formulas  for  F~  and  Q~  are  expressed  in  terms  of  the  distinct  values  in 
the  sample,  denoted  v\  <  ...  <  their  relative  frequencies 

Pj~  =  fraction  of  sample  =  vj, 

and  their  cumulative  relative  frequencies 

Uj  =  pi  •  •  •  -t*  Pj  ij  —  1, . . . ,  k. 

Define  uq~  =  0,  uq  =  — oo,  =  oo. 

The  sample  distribution  function  is  discrete  (piecewise  constant)  satisfying  (for 

j  a*  0,1,..., k) 

F~(y)  =  t if  for  v,-  <  y  <  vj+1. 

The  sample  quantile  fimction  is  discrete  (piecewise  constant)  satisfying  (for  j  = 

1  ,....*) 

Q~(u)  =  vj  for  uj-l~  <  u  <  Uj~. 

We  summarize  these  formulas  by  saying  F~  and  Q~  are  piecewise  constant  be¬ 
tween  their  values 

F~(vj)  =  uf, 

Q~(uj~)  -  Vj. 


5.  Comparison  distribution  and  comparison  density  for  continuous  F,  G 
Let  F0  be  a  specified  continuous  distribution  (satisfying  Q$Fg(y)  =  y),  which  is 
a  model  for  F,  the  unknown  true  distribution  function  of  a  continuous  random  vari¬ 
able  Y.  An  important  conceptual  tool  in  statistical  data  analysis  is  transforming 
Y  to  the  random  variable  W  =  Fg(Y)  which  has  quantile  function 

Qw(w)  =  *XQy(u)) 

and  distribution  function 

F\v(y)  =  fy(Qo(u)) 

How  does  one  benefit  by  transforming  probability  law  estimation  problems  to 
probability  law  estimation  for  a  variable  W  on  the  unit  interval?  One  could  form 
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an  estimator  of  the  probability  density  of  Y  from  an  estimator  of  the  probability 
density  of  W  by  sample  analogies  of  the  formulas  relating  their  probability  densities 

fw(y)  =  fy(Qe(u))/  foQo(u)i 
/y(y)  =  fo  (y)fw(Fe(y)). 

Estimation  of  fw(u)  provides  estimation  of  /y(y).  One  could  form  an  estimator 
of  the  quantile  and  quantile  density  function  of  Y  by  sample  analogues  of  the 
formulas 

Qy(u)  =  Qe(Qw(u)\ 
qY(u)  =  qo(Qw(u))<lw(u) 

which  require  estimation  of  Qw(u)  and  gyy(u).  In  our  view  the  conceptual  im¬ 
portance  of  the  transformation  to  W  comes  from  interpreting  qw  as  a  comparison 
density  function  d(u;  Fy,Fg). 

To  two  distribution  functions  F  and  G  we  associate  a  comparison  distribution 
function  on  0  <  u  <  1,  denoted  D(u)  =  D(ti;  G,  F).  We  consider  three  cases:  both 
continuous;  both  discrete;  one  continuous  and  one  discrete. 

When  F  and  G  are  continuous  we  define 

D(u;G,F)  =  F(G-‘  (u)) 

with  comparison  density  function  d(u)  =  D'(u)  given  by 

<*(»)  =  d(u-,G,F)  =  f(.G-\u))/g(.G-\u)). 

We  assume  that  f(y)  >  0  implies  g(y)  >  0.  Then  D( 0)  =  0,  D(  1)  =  1.  We  call  F 
and  G  equivalent  if  f(y )  >  0  if  and  only  if  g(y)  >  0. 

In  terms  of  comparison  distribution  functions  we  express  the  quantile  and  dis¬ 
tribution  functions  of  W  =  Fg(Y): 

Qw(u)  =  D(u;Fy,Fg )  =  D{u\  data, model) 

F\y(u)  =  D(u;  Fg,  Fy)  —  D(u  :  model, data). 


6.  Comparison  distribution  and  comparison  density  for  discrete  F,G. 

When  F  and  G  are  discrete  we  assume  that  their  respective  probability  mass 
functions  pf(v)  and  Pc(y)  satisfy 

pp(y)  >  0  implies  po(y)  >  0. 

We  call  F  and  G  equivalent  if  p/p (y)  >  0  if  and  only  if  Pg(v)  >  0.  In  the  discrete 
case  we  define  first  the  comparison  density 

d(u)  =  d(u;  G,  F)  =  wtG-'MJ/pGfG-'fiO) 


and  define  its  integral  D(u)  =  D(ti;  G,  F)  by 


D(u)  =  D(u;G,F)  = 
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Our  assumptions  guarantee  that  D(  1)  =  1. 

Analogues  of  this  definition  will  be  given  in  section  11  for  F  continuous  and  G 
discrete  based  on  the  following  characterization  of  D(u );  it  is  a  P— P  plot  obtained 
by  joining  linearly  the  points 

(0, 0);(G(uj),  =  l,...,fc;(l,l) 

where  tq  <  . . . ,  t >je  are  the  distinct  values  at  which  G  jumps  (which  we  have 
assumed  to  include  all  values  at  which  F  jumps). 

We  rail  our  approach  to  data  analysis  “functional”  because  it  emphasizes  form¬ 
ing  and  smoothing  functions  on  the  interval  0  <  u  <  1;  raw  estimators  d~( u;  Fy ,  Fg) 
and  d~(Fg,Fy)  are  smoothed  to  form  estimators  cf(u;  Fy,Fg)  and  cT(u]Fg,Fy). 
The  graphs  it  provides  for  graphical  data  analysis  are  pictures  of  functions. 

7.  Information  measures:  Renyi,  Chi-square 

Comparison  densities  provide  insight  into  information  methods  because  infor¬ 
mation  measures  of  univariate  distributions  can  be  expressed  in  terms  of  d(u;  F,  G ). 
Information  measures  play  a  central  role  in  statistical  data  analysis  because  they 
provide  tools  to  measure  the  “distance”  between  two  probability  distributions  F 
and  G.  The  (Kullback-Liebler)  information  divergence  is  defined  (Kullback  (1959)) 
by  (our  definitions  differ  from  usual  definitions  by  a  factor  of  2) 

I{F\  G)  =  (-2)  /  log (g(x)/f(x))f(x)dx 

J—oo 

when  F  and  G  are  continuous  with  probability  density  functions  f(x)  and  g(x). 
When  F  and  G  are  discrete,  with  probability  mass  functions  pp(x)  and  pq(x), 
information  divergence  has  an  analogous  definition: 

I(F;G)  =  (-2)  ^  log  {pg(*)/pf(*)}pf(*)- 

An  information  decomposition  of  information  divergence  is 

J(F;G)  =  ff(F;G)-/f(F), 

in  terms  of  entropy  H(F)  and  cross-entropy  H(F]  G): 

H{F)  =  (-2)  [°°  {log  f(x))f(x)dx, 

J — OO 

ff(F;G)  =  (-2)  f  {log g(x)}f{x)dx. 

J—oo 

Adapting  the  fund. oriental  work  of  Renyi  (1961)  we  define  Renyi  information 
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of  index  A.  For  continuous  F  and  G:  for  A  ^  0,  -1 


™i(F;G)  =  I  {m)M  f(v)dv 
=  Mrn)^I{m}Xs{v)dv 
IRo^G^2I{w)^m}f(v)dy 

IR-l(F-,G)  =  -2  J  {log^i}  /(y)dtf 

An  analogous  definition  holds  for  discrete  F  and  G. 

The  second  definition  provides:  (1)  extensions  to  non-negative  functions  which 
are  not  densities,  and  also  (2)  a  non-negative  integrand  which  can  provide  diag¬ 
nostic  measures  at  each  value  of  y. 

Renyi  information,  for  -1  <  A  <  0,  is  equivalent  to  Bhattacharyya  distance. 
Hellinger  distance  corresponds  to  A  =  —  .5. 

In  addition  to  Renyi  information  divergence  (an  extension  of  information  statis¬ 
tics)  one  uses  as  information  divergence  between  two  non-negative  functions  an 
extension  of  chi-square  statistics  which  has  been  developed  by  Read  and  Cressie 
(1988).  For  A  ^  0,  -1,  Chi-square  divergence  of  index  A  is  defined  for  continuous 
F  and  G  by 

where 

Bo(d)  =  2{d\ogd-  d+  1} 

B_!(d)  =  -2{logd-d  +  l) 
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Important  properties  of  B\(d)  are: 

BX(d)  >  0,SA(1)  =  Bi(l)  =  0, 

B'x(d)  =l(dX-  l),Bi'(<0=2dA-1 

Bi(d)  =  (d  -  l)2 
Bfj(d)  =  2(dlogd—  d+  1) 

B-.s(d)  =  4  (a-5  - i)2 

£_l(d)  =  -2(logd-d+l) 

s_2(<i)  =  d(<r1-i)2 

An  analogous  definition  holds  for  discrete  F  and  G.  Axiomatic  derivations  of 
information  measures  similar  to  C\  are  given  by  Jones  and  Byrne  (1990). 

The  Renyi  information  and  chi-square  divergence  measures  are  related: 

IRo(F;  G)  =  C0(F;  G) 

IP_l(F;G)  =  C_i(F;G) 

For  A  ±  0,-1, 

'W'V  =  mITI)10*!1  +  (^)  C^G>} 

Interchange  of  F  and  G  is  provided  by  the  Lemma :  when  F  and  G  are  equiva¬ 
lent, 

Cx(F;G)  =  C_(l+x)(G;F) 

IRx(F-,G)  =  IR_(l+i)(G-,F) 

Example:  Renyi  information  divergence  of  two  zero  mean  univariate  normal 
distribution 3.  Let  P:  be  the  distribution  on  the  real  line  corresponding  to  Normal 

Ko 

(0, Kj)  with  variance  Kj.  Let  k  =  — .  Then 

d(u-,P2,Pi)  =  *  5exp  {-.5(*  -  l)|*-‘(u)|2} 
jr_i(P2;  Pi)  =  «  -  i  -  log  k, 

■T*A(ft;ft)  =  (l/A){lo*<c  -  (1  +  A)-‘  log{l  +  (1  +  A)(*  -  1)}+} 

cx  (ft;  Pi)  =  {2/A(l  +  A))  *-5<1+x>  {!  +  (!  +  A)(*  -  l));  s 


8.  Information  for  comparison  density  functions 

Information  divergence  I(F\  G )  is  a  concept  that  works  for  both  multivariate 
and  univariate  distributions.  In  the  univariate  case  we  are  able  to  relate  I(F]  G) 
to  the  concept  of  comparison  density  d(u\F,G ), 

For  a  density  d(u),  0  <  u  <  1,  Renyi  information  (of  index  A),  denoted  IRx(d), 
is  non-negative  and  measures  the  divergence  of  d(u)  from  uniform  density  do(u)  = 
1,  0  <  u  <  1.  It  is  defined: 

IRo(d)  =  2  [  {d(u)  log  d(u)}du  =  2  f  {d(u)\ogd(u)  —  d(u)  +  1}  du 
J  o  Jo 

IR-l(d)  =  -2  [  {logd(u)}du  =  -2  /  (logd(u)  -  d(u)  +  1}  du 
Jo  Jo 


for  A  ^  0  or  -1 

IRx(d)  =  {2/A(l  +  A)} log  (d(u)}>+xdu 

Jo 

=  {2/A  (1  +  A)}  log  (M«)}1+i  -  (1  +  A)  {d(u)  -  1})  du. 

To  relate  comparison  density  to  information  divergence  we  use  the  concept  of 
Renyi  information  IR\  which  yields  the  important  identity  (and  interpretation  of 
I(F-,G)l) 


I(F',  G)  =  (—2)  f  logd(u,F,G)du 
Jo 

=  /i?_l(d(u;F,G))  =  IRo(d(u-,G,F)). 
For  a  density  d(u),  0  <  u  <  1,  define 


CA(d)  =  fl  Bx(d(u))du. 

Jo 

The  comparison  density  again  unifies  the  continuous  and  discrete  cases.  One  can 
show  that  for  univariate  F  and  G 

C\(F,G)  =  C\(d(u-,F ,G)) 


For  a  random  sample  of  a  random  variable  with  unknown  probability  density 
/,  maximum  likelihood  estimators  0“  of  the  parameters  of  a  finite  parameter  model 
fg  of  the  probability  density  /  can  be  shown  to  be  equivalent  to  minimizing 


=  /7i_i  (d(«; 
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where  f~  is  a  raw  estimator  of  /  (initially,  a  symbolic  sample  probability  density 
formed  from  the  sample  distribution  function  F~). 

9.  Information  measures  and  entropy  tests  of  fit 

To  test  the  goodness  of  fit  hypothesis  for  a  continuous  random  variable  Y 

H0  :  FY(y)  =  Fe(y), 

many  statistics  have  been  proposed  which  start  with  the  probability  integral  trans¬ 
formation 

W  =  Fe(Y) 

for  which  the  goodnes  of  fit  hypothesis  is  Hq :  W  is  Uniform[0,l]. 

An  entropy  test  of  fit  is  Moran’s  statistic  which  transforms  y^, . . . ,  Kn  to 
Wi, . . .  ,Wn  and  forms  the  order  statistics  W(l;n)  <  ...  <  W(n;n)  with  spac- 
ings 

Si(0)  =  W(i-  n)  -  W(i  -  1;  n),j  =  1, . . . ,  n  +  1, 
defining  W(0,  n)  =  0,  W{n  +  1;  n)  =  1  Moran’s  statistic  is  often  defined  as 

n+1 

E-'o«5.(«), 

t=l 

We  prefer  to  define  it  as 

n+1 

M-(S)  =  (n  +  I)"1  ^(-2)log(n l)Sf(«). 

1=1 

Under  the  null  hypothesis,  it  is  asymptotically  normal  with  mean  2y  (7  =  .57722, 
Euler’s  constant),  and  variance 

4(n  +  l)-‘((»2/6)  -  1). 

Small  sample  asymptotic  chi-square  and  beta  distributions  (given  by  Smethurst  and 
Mudholkar  (1991))  are  more  appropriate  for  an  entropy  interpretation. 

In  order  to  understand  and  extend  M~(9),  we  regard  it  as  an  estimator  of  a 
quantity  M(6)  defined  by  probability  theory.  The  original  observation  Y  has  true 
distribution  function  Fy  and  quantile  function  Qy(u)  =  Fy  ^(u).  The  transforma¬ 
tion  W  =  Fff(Y)  has  quantile  function  Q\y(u)  =  F$(Qy(u))  and  distribution  func¬ 
tion  F\y(u)  =  Fy(Qo(u));  W  has  quantile  density  giy(u)  =  /?(Qy(u))//yQy(u), 
and  entropy 

H{W)  =  f  (-21og  fw(w))fw(w)dw  =  l  21og qw{u)du. 

Jo  Jo 

Under  the  null  hypothesis  Hq  :  Fy  =  F$,  qw(u)  i®  identically  1,  and  H(W)  =  0. 
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Note  —H(W),  the  neg-entropy  of  W,  is  non-negative  and  is  the  population 
parameter,  denoted  M(6),  which  is  being  non-parametrically  estimated  by  Moran’s 
statistic  M~{9). 

How  do  we  benefit  from  estimating  entropy  (or  neg-entropy)  of  W?  It  provides 
tests  of  Hq  and  can  provide  (through  suitable  analogues  of  Akaike  Information 
Criterion)  insight  about  selection  of  alternative  models  to  fit  when  one  rejects 
Hq.  Thus  understanding  and  improving  Moran’s  statistic  requires  us  to  solve 
problems  of  density  estimation,  especially  estimation  of  the  smooth  comparison 
density  d(u;  Fy,Fg)  from  raw  estimators 

d'(u;FY,Fff)  =  (r(u;F-c,F0), 

where  F~c  is  a  continuous  version  of  the  discrete  distribution  function  F~. 

Another  important  interpretation  of  M(8)  is  M{6)  =  /(/;/#),  the  Kullback 
information  divergence  between  the  true  F(y)  and  the  model  Fg(y). 

10.  Continuous  versions  of  discrete  distribution  functions 

To  compare  a  continuous  and  a  discrete  distribution  we  propose  forming  a 
continuous  distribution  function  version  of  a  discrete  one.  To  estimate  a  continuous 
distribution  function  F  from  data  we  recommend  first  forming  our  continuous 
version  F~c  of  the  discrete  distribution  function  given  by  the  sample  distribution 
function  F~  formed  from  the  data. 

For  discrete  data  we  recommend  estimating  the  continuous  version  Fc  of  its 
discrete  distribution  function  F.  We  conjecture  that  these  recommendations  pro¬ 
vide  a  unified  theory  of  discrete  and  continuous  data  analysis  as  well  as  improved 
methods  of  continuous  data  analysis. 

To  define  the  continuous  version  of  a  discrete  distribution  F,  we  assume  that 
it  can  be  described  by  a  finite  number  of  points  (vj,uj)  such  that 

F(vj)  —  Uj  for  j  =  1, . . . ,  k. 

Note  that  u*  =  1.  Define  uq  =  0;  then  0  =  uq  <  tq  <  . . .  <  uk  =  1. 

Its  quantile  function  Q(u)  is  discrete  and  satisfies,  for  j  =  1  ,...,&, 

Q(uj)  =  vj. 

Define  “mid-values”  Vj,  j  =  0, . . . ,  k ,  by 

Vq  =  V\,VCk  =  Ujfc, 

vcj  =  .5 (vj  +  Uj+i)  for  j  =  1, . . . , k  —  1. 

Define  Fc  and  Qc  to  be  piecewise  linear  between  its  values  (for  j  =  0, 1, . . . ,  k) 

QC(uj)  =  4 

F°(v))  =  uj 
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We  call  F*  and  Qc  continuous  (piecewise  linear)  versions  of  the  discrete  (piecewise 
constant)  functions  F  and  Q. 

It  is  interesting  to  compare  the  continuous  version  of  a  discrete  distribution  to 
its  mid-distribution  Fmid  whose  definition  we  recall;  define  pj  =  uj  —  and 

Fmid(vj)  =  uj  —  .5pj  for  j  =  1, k. 

One  conjectures  that  approximately  (and  exactly  when  vj  are  equi-spaced) 

Fc(vj)  =  uj  -  .5 pj, 

so  that  approximations  to  Fc  are  also  approximations  to  Fmtd. 

To  justify  our  view  that  these  concepts  are  very  natural,  we  would  argue  that 
the  continuity  correction  when  one  approximates  a  discrete  distribution  by  a  con¬ 
tinuous  one  (say  the  binomial  by  the  normal)  can  be  explained  by  regarding  the 
limiting  continuous  distribution  as  approximating  Fc  and  Fmid  rather  than  F. 

Let  F  be  the  distribution  function  of  a  Binomial(n,p)  random  variable.  The 
continuity  correction  says  that  (for  x  =  0, 1, . . . ,  n)  approximately 

F(x)  =  Fc{x  +  .5)  =  $((x  +  .5  -  np)/(np(l  -  p))'5) 

Note  that  (for  x  =  0, 1, . . . ,  n)  approximately 

F'(x)  =  FmiJ(x)  =  *((*  -  np)/(np(l  -  pfy 


11.  Comparison  distributions  of  one  sample  (continuous  data) 

Given  a  sample  Y\, ...  ,Yn  from  a  continuous  distribution  F,  we  recommend  as 
the  first  step  in  data  analysis  to  compute  and  plot  F~c  and  Q~c,  the  continuous 
versions  of  the  discrete  sample  distribution  and  quantile  functions.  We  regard  Q~c 
as  a  raw  estimator  of  the  true  quantile  Q  which  provides  a  minimal  amount  of 
smoothing  of  the  observations. 

The  process  of  fitting  a  model  to  the  data  can  be  formulated  in  terms  of  a 
specified  continuous  distribution  function  Fg  (whose  form  may  be  guessed  from  a 
visual  examination  of  Q~e,  normalized  at  u  =  .5  to  equal  0  and  to  have  slope  1 
(compare  Parzen  (1986))).  We  not  only  estimate  F  but  also  test  the  goodness  of 
fit  hypothesis  Hq  :  F  =  Fg.  To  motivate  our  approach  let  us  review  some  methods 
of  testing  a  goodness  of  fit  hypothesis  Hq  for  continuous  data. 

A  graphical  diagnostic  of  Hq  is  the  Q  —  Q  plot  which  looks  for  linearity  in  the 
graph  of 

(Q,(F~miJ(vj)),vj)J=l....,t; 
an  alternative  is  the  Q  —  Qc  plot  which  graphs  the  points 

(9#(«i). »,•)  =  («*(*>>)).«’;), j  =  i . *  - 1. 
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Goodness  of  fit  tests  of  Hq  have  traditionally  been  expressed  by  statisticians 
as  measures  of  F~(y)  —  Fg(y ),  the  difference  of  distribution  functions.  Typical 
measures  are  non-linear  functionals  such  as 

D  [KolmogorovS  mimov]  =  supremumy  |  F~(  y  )  —  F^(y)| 

/oo 

|  F~(y)  —  Fg(y)\^dFg(y). 

-oo 

Goodness  of  fit  compares  probabilities;  we  believe  that  probabilities  p  and 
p",  representing  data  and.  model ,  should  be  compared  not  by  their  differences  but  by 
their  ratio!  In  symbols,  measure  (p~/p~)  —  1  rather  than  p~ — p~.  Therefore  goodness 
of  fit  tests  should  be  based  on  measures  of  the  difference  from  the  identity  function 
Dq(u)  =  u,0  <  u  <  1,  of  comparison  distribution  functions.  Goodness  of  fit  tests 
for  uniformity  are  traditionally  based  on 

D(u\Fg,F~)  =  Fw’(u)  =  F~(Qg(u)),  0  <  u  <  1, 

the  sample  distribution  function  of  the  probability  integral  transformation  W  = 
Fg(Y).  Traditional  maximum  likelihood  estimators  of  8  are  chosen  by  the  criterion 
that  the  sample  quantile  function 

D(u-F~,Fe)  =  Qw~(u )  =  F^(Q-(u),0  <  u  <  1, 

has  minimum  Kullback  information  distance  from  D0(u)  =  u. 

This  paper  proposes  that  we  need  to  overcome  the  problem  that  F~  and  Q~  are 
discrete  and  are  not  directly  covered  by  our  definitions  of  comparison  distribution 
functions;  we  recommend  that  data  analysis  be  based  on  the  definitions  below 
of  continuous  raw  comparison  functions,  denoted  Dc(u;  F~,  Fg)  and  De(u]Fg,F~). 
Analogously  one  defines  comparison  distribution  functions,  denoted  Dc(u]G,F) 
and  Dc(u ;  F,  G)  rather  than  D(u;  G,  F)  and  D(u;  F,  G),  when  F  is  continuous  and 
G  is  discrete. 

Recall  F~(vj)  —  uj  for  j  =  1, . . . ,  k.  Let  Vj,  j  =  0, . . . ,  k  be  mid  values.  Define 
for  j  =  1, . . . ,  k  -  1 

w j  =  Fe{Q~\uj))  =  Fg(  v})). 

Assume  0  <  uq  <  . . .  Wjc_i  <  1. 

Define  Q^{u)  as  a  piecewise  linear  curve  connecting  the  values 
(0, 0),  (uj,  xv j )  for  j  -  1, ...,  k-  1,  (1, 1). 

Define  Ffc(u)  as  a  piecewise  linear  curve  connecting  the  values 

(0, 0),  (xvj ,  Uj)f  or  j  =  l,....,fc-l,(l,l). 

The  derivatives,  denoted  ?£y(u)  and  /^(u)  respectively,  are  sample  quantile 
density  and  probability  density  functions.  Define  dc(u\F~,Fg)  =  qyrr(u),dc(u;F^, 

n=few^y 
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One  smooths  raw  densities  to  form  smooth  estimators,  denoted 

Qw\u),  Fw‘(u). 

The  adequacy  of  the  smoothing  can  be  judged  by  comparing  on  one  graph  Qyy 
and  Qw~,  and  comparing  on  one  graph  Ffo  and  F\v‘.  In  this  way  one  can  develop 

fy\y)  and  Qy*(u). 

12.  One  sample  parameter  estimation. 

Regular  maximum  likelihood  estimators  B"  are  parameter  values  minimizing 

/  “  log  fe(Q~(.u))du 
JO 

or  equivalently  minimizing  the  negative  of  the  average  log  likelihood 

n 

-!(#)  =  (l/n)J>  log /»(*-(;;  n)) 
j=i 

A  maximum  spacings  estimator,  denoted  6',  can  be  obtained  (compare  Ranneby 
(1984))  by  minimizing  (with  respect  to  all  possible  parameter  values  B)  the  neg- 
entropy 

2  /  -  log  *(«;  F\  F0)du 
Jo 

or  equivalently  minimizing 
k 

-2  £(“>  -  «;-i)iog((/>(Q-e(t.i))  -  -  u,_i)) . 

i= 1 

In  this  expression,  logarithm  is  taken  after  integration  rather  than  before;  conse¬ 
quently  it  provides  estimators  in  non-regular  cases. 

Minimum  information  estimators  (more  precisely,  minimum  Renyi  information 
of  index  A  estimators)  6~,  minimize  (for  —  1  <  A  <  0,  and  especially  A  =  —.5 
(compare  Beran  (1977))). 

iRx(<r(u,r,F9)). 

Regular  maximum  likelihood  estimators  (which  correspond  to  A  =  —  1)  satisfy  the 
estimating  equations 

f  Sgt  (Q“*(u))  du  =  0 

where 

s8i (y)  =  ^i°g/*(y) 
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is  the  score  function  of  component  <?,-  of  the  vector  parameter  8.  Minimum  infor¬ 
mation  estimators  satisfy  the  estimating  equations 

J  (de(u;F-,Ft)>,+XS)l(Q-‘(u))du  =  0. 

Minimum  information  estimators  provide  robxat  estimators. 

To  test  if  robust  minimum  information  estimators  of  a  given  data  set  are  to 
be  preferred  to  regular  maximum  likelihood  estimators,  one  could  test  if  the  latter 
satisfy  the  estimating  equations  of  the  former.  The  theory  and  practice  of  this 
“test  for  robustness”  are  open  research  problems. 

13.  Comparison  distributions  of  two  samples  (continuous  data) 

A  central  problem  of  statistics  is  test 

Hq  :  F\  =  ^2> 

the  equality  of  two  continuous  distribution  functions  F\  and  F2.  The  data  are 
assumed  to  be  independent  observations  of  a  first  sample 
X\ , . . . ,  Xnx  assumed  to  be  distributed  as  Fj , 
and  a  second  sample 

yj, . . . ,  y„2  assumed  to  be  distributed  as  F2. 

Let  F\  and  F<f  denote  the  sample  distribution  functions  of  the  two  samples. 
The  pooled  sample  consists  of  all  X  and  Y  values;  its  sample  distribution 
function  is  denoted  F~.  Let  n  =  nj  + 112  be  the  total  sample  size.  Let  pj  =  nj/n 
be  the  fraction  of  the  pooled  sample  in  the  j-th  sample.  One  can  represent 

F~  =  PlFl~  +  P2F2~. 

it  is  an  estimator  of  the  true  pooled  distribution 

F  =  p\F\  +P2F2. 

The  novelty  of  our  approach  to  testing  Hq  is  our  proposed  comparison  distri¬ 
bution  function 

D-(u)  =  D(u;F-,Fn 

which  estimates  D(u)  =  D(u;F,F\).  Because  F~  and  F\~  are  both  discrete,  the 
comparison  distribution  D~(u)  is  defined  in  terms  of  the  comparison  density  func¬ 
tion 

<r(u)  =  <f(u;F-,Ff). 

The  asymptotic  distribution  of  D~( u)  as  an  estimator  of  D(u)  =  D(u;.F,Fi)  can  be 
shown  to  be  the  same  as  the  Pyke-Shorack  two  sample  process.  The  mathematical 
statistics  is  the  same,  but  the  data  analysis  is  greatly  improved. 

Linear  rank  statistics  to  test  Hq  can  be  represented  as  linear  functionals 

<  J(u),cT(u)  >=  f  J(u)(T(u)du 

JO 
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for  suitable  score  functions  J(u).  The  Wilcoxon  statistic  (which  tests  for  a  shift  in 
location)  corresponds  to  J(u)  =  u,  whose  orthonormal  version  is  «7(u)  =  12'**(ti  — 
.5). 

14.  Two  sample  comparison  density  estimation. 

We  propose  that  the  best  summary  of  two  sample  data  analysis  is  not  a  p 
value  of  a  linear  rank  statistic  but  a  smooth  density  estimator  d(u)  of  the  true 
comparison  density  d(u)  =  d(u;  F,  F\). 

The  asymptotic  distribution  of  estimators  d‘(u)  which  smooth  <T(u)  is  best 
understood  by  normalizing  it  to  be  between  0  and  1  by  defining 

P~(«)  =  Pl<*”(“) 

whose  smooth  estimators  are  denoted  p*(u).  Their  asymptotic  distribution  theory 
(outlined  below)  is  developed  on  the  assumption  that  they  are  estimators  of  a  true 
comparison  density  d(u )  =  d(u;  F,  F\);  let  p(u)  =  p\d(u).  The  asymptotic  variance 
of  p'(u)  can  be  shown  to  be  proportioned  to  p(u)(l  —  p(u))  which  means  that  its 
asymptotic  distribution  is  similar  to  that  of  an  estimator  p"  of  a  probabiity  p. 

In  contrast  the  asymptotic  variance  in  the  one  sample  case  of  the  smooth  quan¬ 
tile  density  estimator  q~  is  proportional  to  g2,  and  the  asymptotic  variance  of  the 
smooth  probability  density  estimator  /“  is  proportional  to  /  if  /“  is  a  standard 
kernel  estimator,  and  is  proportional  to  /2  if  f‘  is  a  nearest  neighbor  estimator. 

One  of  the  joys  of  a  unifed  framework  for  one  sample  and  two  sample  data 
analysis  is  that  it  can  comprehend  and  explain  the  different  qualitative  behavior 
of  estimators  of  different  types  of  densities.  Parzen  (1983)  states,  and  outlines 
proofs  of  the  following  results  about  comparison  density  estimation. 

A  kernel  comparison  density  estimator  has  the  form 


<T(u)=  fl  KM(u-t)<T{t)dt 
Jo 


where 


Ku(t)  =  £  «-**•**(») 
v=—oo 

= *  (£) 

We  call  M  truncation  point  (or  effective  number  of  parameters);  it  is  chosen  as  a 
function  of  n  and  tends  to  oo,  as  n  tends  to  oo,  at  a  suitable  rate.  Let 


K^=  K\t)dt  =  /°°  k2(x)da 
J— oo  J— oo 


One  can  show  that  (by  letting  M  tend  to  oo  at  a  suitable  rate) 

„Um  Var  [p'(u)]  =  Fp(ti)(l  -  K«)) 
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The  numerical  derivative  density  estimator 

<T(u)  =  (2h)~\D-(u  +  h)~  D~(u  -  h ) 

corresponds  to  M  —  l/h  and  k(x)  =  (sin2rr®)/27rx;  it  has  asymptotic  variance 

lim  /mVar[p‘(u)]  =  .5p(u)(l  -p(u)). 

n—*oo 

Evaluate  from  K{t )  =  .5  for  |t |  <1,0  otherwise. 

An  autoregressive  (of  order  m)  estimator  has  asymptotic  variance 

Km  “  Var[p‘(u)]  =  2p(u)(l  -  p(u)) 
n— *oo  m 

Model  order  selection  techniques  can  be  developed  by  adapting  Akaike  (1973), 
Atilgan  and  Bozdogan  (1990),  Sakamoto,  Ishiguro,  Kitagawa  (1983). 

To  obtain  one  sample  probability  density  estimator  results  from  the  two  sample 
results,  let  the  first  sample  have  unknown  distribution  F\ ,  the  second  sample  have 
known  distribution  F2,  and  let  n2  be  very  large.  Then  the  pooled  distribution 
F  =  F2,  d(u)  equals  d(u;  F2,  Fj),  and  p\  tends  to  0.  The  kernel  probability  density 
estimator  of  f\  has  asymptotic  variance 

.Km,  ^  Var[<r(u;  Fj,  F,)]  =  K*d(u;  F2,  F, ) 

since 

J-M  =  Pl/l(F~l(u))//(F-1(u))  =  P1d(u;F2,F,). 
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